Syllogism Conclusion Generator with GPT-2

Python
Deep Learning
NLP

Given two premises that form a valid syllogism, this autoregressive model can accurately complete the syllogism by generating a conclusion.

Author

Jake Gehri

Published

November 29, 2022

import torch
import pandas as pd
import numpy as np
from datasets import load_dataset, Dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

Introduction

This notebook will be an extension of the last notebook that was working to classify whether two premises could be used to generate a valid conclusion. That model used the a BERT architecture and was fine-tuned on the Avicenna syllogism dataset. This notebook will use the same dataset, but instead fine-tune a GPT-2 model to take in two premises as input and generate the corresponding conclusion.

I had to write custome start, end and pad tokens in order to properly pad each input as to be forced to randomly chop up syllogisms into pieced and create input blocks of equal size.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
file_name = 'Avicenna_Train.csv'
model_cp = "gpt2"
max_length = 200
tokenizer = GPT2Tokenizer.from_pretrained(model_cp, bos_token = '<startoftext>', 
                                          eos_token='<endoftext>', pad_token='<pad>')
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False, return_tensors='pt')
model = GPT2LMHeadModel.from_pretrained(model_cp).to(device)
model.resize_token_embeddings(len(tokenizer))
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Embedding(50260, 768)
def tokenize(batch):
    return tokenizer(batch['text'], truncation=True, max_length=max_length, padding='max_length')

The data came in a csv file containing premise 1, premise 2, validity and conclusion. I needed to filter the dataset removing all invalid syllogisms and then combine the premises and conclusions into a single string for fine-tuning. I found out that telling the model which premise was which and where the conclusion started improved training. Additionally, adding a $ after the last premise slightly imporved training, but this was moreso done to replicate what as done in the original GPT paper.

def prepare_dataset(file_name):
    dataset = load_dataset('csv', data_files=file_name, sep = ',', encoding = 'ISO-8859-1')
    dataset.set_format(type='pandas')
    df = dataset['train'][:]
    df = df[df['Syllogistic relation'] == 'yes']
    df['text'] = '<startoftext>' + 'Premise 1: ' + df['Premise 1'] + 'Premise 2:' + df['Premise 2'] + '$' + 'Conclusion:' + df['Conclusion'] + '<endoftext>'
    df.reset_index(drop=True, inplace=True)
    df = df[['text']]
    dataset = Dataset.from_pandas(df)
    dataset = dataset.map(tokenize, batched=True, num_proc=4, remove_columns=["text"])
    return dataset
train_dataset = prepare_dataset('Avicenna_Train.csv')
test_dataset = prepare_dataset('Avicenna_Test.csv')
Using custom data configuration default-9e35c2288d530357
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-9e35c2288d530357/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)
       
 
Using custom data configuration default-80959f65edc13f7a
Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-80959f65edc13f7a/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58)
        
tokenizer.decode(train_dataset['input_ids'][0])
'<startoftext> Premise 1: Chronic diseases are heart attacks and stroke, cancer such as breast and colon cancer, diabetes, epilepsy and seizures, obesity, and oral health problems.Premise 2:In populations that eat a regular high-fiber diet of more than 50 grams of fiber per dayTrusted Source, like rural South Africans, chronic diseases are very low.$Conclusion:In populations that eat a regular high-fiber diet of more than 50 grams of fiber per dayTrusted Source, like rural South Africans, heart attacks and stroke, cancer such as breast and colon cancer, diabetes, epilepsy and seizures, obesity, and oral health problems are very low. <endoftext> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>'

Training

model_name = model_cp.split("/")[-1]
training_args = TrainingArguments(
    f"{model_cp}-finetuned-syllogism",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
)
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    data_collator=data_collator
)
trainer.train()
/usr/local/lib/python3.9/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 2427
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 912
[912/912 04:18, Epoch 3/3]
Epoch Training Loss Validation Loss
1 No log 2.311477
2 2.267000 2.312688
3 2.267000 2.314481

***** Running Evaluation *****
  Num examples = 630
  Batch size = 8
Saving model checkpoint to gpt2-finetuned-syllogism/checkpoint-500
Configuration saved in gpt2-finetuned-syllogism/checkpoint-500/config.json
Model weights saved in gpt2-finetuned-syllogism/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 630
  Batch size = 8
***** Running Evaluation *****
  Num examples = 630
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=912, training_loss=2.215385637785259, metrics={'train_runtime': 258.8719, 'train_samples_per_second': 28.126, 'train_steps_per_second': 3.523, 'total_flos': 743151283200000.0, 'train_loss': 2.215385637785259, 'epoch': 3.0})
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
***** Running Evaluation *****
  Num examples = 630
  Batch size = 8
[79/79 00:05]
Perplexity: 10.12

Testing

First a classic example

test = 'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion: '
input_ids = tokenizer(test, return_tensors='pt')['input_ids'].to(device)
output_greedy = model.generate(input_ids, max_length=25)
tokenizer.decode(output_greedy[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion:  Socrates is mortal'

I dont know why it is generating this error, I believe it has something to do with me changing the models innate tokens in the beginning. But we can see that the model was able to accurately generate the conclusion for this syllogism. This first example is using greedy search, where our model simply makes a next word prediction based on the our probability distribution over the vocabulary.

output_beam = model.generate(input_ids, max_length=25, num_beams=5)
tokenizer.decode(output_beam[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion:  Socrates is mortal'

Beam search is when the model generates n number of ‘beams’ or full sentence predictions (in this case 5) and then a word is decided based on highest probability and we continue moving down the rest of the sentence, not going back to earlier ones. This model also looks good. Beam will usually outperform greedy.

The End

See below for more tests and search methods.

output_temp = model.generate(input_ids, max_length=25, do_sample=True, temperature = 0.5)
tokenizer.decode(output_temp[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion:  Socrates is mortal'
output_topk = model.generate(input_ids, max_length=25, do_sample=True, top_k=50)
tokenizer.decode(output_topk[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion:  Socrates is a'
output_topp = model.generate(input_ids, max_length=25, do_sample=True, top_p=0.90)
tokenizer.decode(output_topp[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All men are mortal. Premise 2: Socrates is a man. $ Conclusion:  Socrates is mortal'
test2 = 'Premise 1: All mammals are animals. Premise 2: All elephants are mammals. $ Conclusion: '
input_ids = tokenizer(test2, return_tensors='pt')['input_ids'].to(device)
output_greedy = model.generate(input_ids, max_length = 50)
tokenizer.decode(output_greedy[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All mammals are animals. Premise 2: All elephants are mammals. $ Conclusion:  All elephants are animals. <endoftext>                       '
output_beam = model.generate(input_ids, max_length = 50, num_beams=5)
tokenizer.decode(output_beam[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All mammals are animals. Premise 2: All elephants are mammals. $ Conclusion:  All elephants are animals. <endoftext>                       '
test3 = 'Premise 1: All mammals are warm-blooded. Premise 2: All black dogs are mammals. $ Conclusion: '
input_ids = tokenizer(test3, return_tensors='pt')['input_ids'].to(device)
output_beam = model.generate(input_ids, max_length=40, num_beams = 5)
tokenizer.decode(output_beam[0])
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
'Premise 1: All mammals are warm-blooded. Premise 2: All black dogs are mammals. $ Conclusion:  All black dogs are warm-blooded. <endoftext>   <endoftext>  animal is warm-'